Regex Extension: Character Classes

Part of the Regex Documentation website.

Character Classes

The ability to match a class, or collection, of characters at a specific point in a target string permits patterns that can match a range of text.
Including a class of characters to be possibly matched is achieved through one of three methods:
    the dot metacharacter;
    character classes;
    class shorthands

dot -- the match-any-character class (.)
A class that matches any character except the null character '\0'. Since it matches almost any character, it is the most general of all possible character classes.
Example
regex.easyMatch ("c.t","catheter")
   » true

Character classes ([...])
A character class, also known as a "list" and "bracket expression", is a list of one or more items. The list is defined through the items included between the squarebrackets, "[...]".
An item in a character class can be either an ordinary character, representing itself, or a metacharacter. However, the definitions for metacharacters within a character class are different from those metacharacters outside of character classes.

Example
"[abc]" matches either "a" or "b" or "c".
"Defen[sc]e" will match either "Defense" or "Defence"
If you want to include a "]" in a character list, either include it as the first character (eg "[]]"), or escape it using a backslash (eg "[\\]]").

character-class metacharacters
Character classes have their own rules for what are and what aren't metacharacters. Something that is a metacharacter outside of a character class may not be a metacharacter inside a character class.
For example, the dot metacharacter is just a plain a dot inside a character class.

- the dash The dash indicates a range of characters. A range is formed by placing a dash between two characters.The range represented falls between the beginning and ending elements in the ASCII sequence.
Examples

"[a-z]" is equivalent to "[abcdefghijklmnopqrstuvwxyz]"
"[0-9]" is the same as "[0123456789]"
"<H1>[a-zA-Z0-9 ]+</H1>" may match a level 1 heading in HTML code.

Cases when the dash is not a metacharacter inside a character class:
        the dash is the first or last character in the list;
        the dash is the last character in a range;
        the dash is escaped with a backslash "\".

^ the caret If the caret is the first element in the list, the character class matches any character that is not in the list.
[^...] classes are known as negated character classes
Examples

"[^a-z]" matches any character that is not a lower case alphabetical character.
"<!--[^>]+--!>" will match HTML comments - "[^>]+" means match any character up until a ">" occurs.

\ the escape The escape allows character class metacharacters to be represented as themselves.
When using an escape in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
Example
"[,\\-\\]]" matchs a comma, a dash, and a closing square bracket.

[:...:] POSIX
bracket
expressions A POSIX^* Bracket expression^** contains one of several special class shortcuts These character shortcuts are only valid within character classes

Examples
regex.easyMatch ("[[:alpha:]]", "Ë")
   » true
regex.easyMatch ("[:alpha:]", "Ë")
   » false - because it attempts to match the class ":", "a", "l", "p" and "h" against "Ë".
The supported POSIX characters shortcuts are:

alnum letters (including diacritical characters) and digits.

alpha letters (including diacritical characters).

blank a space or tab.

cntrl control characters in the ASCII encoding (ie codes less than 32 and code 127).

digit digits - 0123456789.

graph same as "print" except omits space.

lower lowercase letters - including diacritical characters.

print printable characters (in the ASCII encoding, space tilde--codes 32 through 126).

punct neither control nor alphanumeric characters.

space space, carriage return, newline, tab, and form feed.

upper uppercase letters - including diacritical characters.

xdigit hexadecimal digits: "0"-"9", "a"-"f", "A"-"F".

Class shorthands
Class shorthands are shortcuts for a character class.
When using an escape, "\", in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.

\d Digit Match any digit. It is equivalent to "[0-9]"

\D Non-digit Match any character that is not a digit. It is equivalent to "[^0-9]"

\s Whitespace Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space.

\S Non-whitespace Match any character that is not whitespace.

\w Word character Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks.

\W Non-word character Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks.

* POSIX - is short for Portable Operating System interface - a standard for ensuring portability across operating systems.
** Actually, a POSIX bracket expression is what we call a character class, and POSIX uses the term "character class" for the metasequences inside a bracket expression. We'll stick with the standard regular expression nomenclature.

This page was last updated at Sun, 08 Nov 1998 18:02:25 GMT.
Please send all questions and comments to regex@lists.scriptmeridian.org.
Check our website for updates to the docs.

\d	Digit	Match any digit. It is equivalent to "[0-9]"
\D	Non-digit	Match any character that is not a digit. It is equivalent to "[^0-9]"
\s	Whitespace	Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space.
\S	Non-whitespace	Match any character that is not whitespace.
\w	Word character	Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks.
\W	Non-word character	Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks.